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Abstract 

Policy gradient methods are an appealing approach in reinforcement learning be¬ 
cause they directly optimize the cumulative reward and can straightforwardly be 
used with nonlinear function approximators such as neural networks. The two 
main challenges are the large number of samples typically required, and the diffi¬ 
culty of obtaining stable and steady improvement despite the nonstationarity of the 
incoming data. We address the first challenge by using value functions to substan¬ 
tially reduce the variance of policy gradient estimates at the cost of some bias, with 
an exponentially-weighted estimator of the advantage function that is analogous 
to TD(A). We address the second challenge by using trust region optimization 
procedure for both the policy and the value function, which are represented by 
neural networks. 

Our approach yields strong empirical results on highly challenging 3D locomo¬ 
tion tasks, learning running gaits for bipedal and quadrupedal simulated robots, 
and learning a policy for getting the biped to stand up from starting out lying on 
the ground. In contrast to a body of prior work that uses hand-crafted policy repre¬ 
sentations, our neural network policies map directly from raw kinematics to joint 
torques. Our algorithm is fully model-free, and the amount of simulated experi¬ 
ence required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real 
time. 


1 Introduction 

The typical problem formulation in reinforcement learning is to maximize the expected total reward 
of a policy. A key source of difficulty is the long time delay between actions and their positive or 
negative effect on rewards; this issue is called the credit assignment problem in the reinforcement 
learning literature (Minsky, 1961; Sutton & Barto, 1998), and the distal reward problem in the 
behavioral literature (Hull, 1943). Value functions offer an elegant solution to the credit assignment 
problem—they allow us to estimate the goodness of an action before the delayed reward arrives. 
Reinforcement learning algorithms make use of value functions in a variety of different ways; this 
paper considers algorithms that optimize a parameterized policy and use value functions to help 
estimate how the policy should be improved. 

When using a parameterized stochastic policy, it is possible to obtain an unbiased estimate of the 
gradient of the expected total returns (Williams, 1992; Sutton et al., 1999; Baxter & Bartlett, 2000); 
these noisy gradient estimates can be used in a stochastic gradient ascent algorithm. Unfortunately, 
the variance of the gradient estimator scales unfavorably with the time horizon, since the effect of 
an action is confounded with the effects of past and future actions. Another class of policy gradient 
algorithms, called actor-critic methods, use a value function rather than the empirical returns, ob¬ 
taining an estimator with lower variance at the cost of introducing bias (Konda (& Tsitsiklis, 2003; 
Hafner & Riedmiller, 2011). But while high variance necessitates using more samples, bias is more 
pernicious—even with an unlimited number of samples, bias can cause the algorithm to fail to con¬ 
verge, or to converge to a poor solution that is not even a local optimum. 

We propose a family of policy gradient estimators that significantly reduce variance while main¬ 
taining a tolerable level of bias. We call this estimation scheme, parameterized by 7 € [0,1] and 
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A S [0,1], the generalized advantage estimator (GAE). Related methods have been proposed in the 
context of online actor-critic methods (Kimura & Kobayashi, 1998; Wawrzynski, 2009). We provide 
a more general analysis, which is applicable in both the online and batch settings, and discuss an in¬ 
terpretation of our method as an instance of reward shaping (Ng et al., 1999), where the approximate 
value function is used to shape the reward. 

We present experimental results on a number of highly challenging 3D locomotion tasks, where 
we show that our approach can learn complex gaits using high-dimensional, general purpose neural 
network function approximators for both the policy and the value function, each with over 10^ 
parameters. The policies perform torque-level control of simulated 3D robots with up to 33 state 
dimensions and 10 actuators. 

The contributions of this paper are summarized as follows: 

1. We provide justification and intuition for an effective variance reduction scheme for policy gra¬ 
dients, which we call generalized advantage estimation (GAE). While the formula has been pro¬ 
posed in prior work (Kimura & Kobayashi, 1998; Wawrzynski, 2009), our analysis is novel and 
enables GAE to be applied with a more general set of algorithms, including the batch trust-region 
algorithm we use for our experiments. 

2. We propose the use of a trust region optimization method for the value function, which we find is 
a robust and efficient way to train neural network value functions with thousands of parameters. 

3. By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning 
neural network policies for challenging control tasks. The results extend the state of the art in 
using reinforcement learning for high-dimensional continuous control. Videos are available at 

https://sites.google.com/site/gaepapersupp. 

2 Preliminaries 

We consider an undiscounted formulation of the policy optimization problem. The initial state 
So is sampled from distribution po- A trajectory (sq, oq, si, oi,... ) is generated by sampling ac¬ 
tions according to the policy at ~ 7r(at | St) and sampling the states according to the dynamics 
St+i ~ P{st+i I St, at), until a terminal (absorbing) state is reached. A reward rt = r{st,at, St+i) 
is received at each timestep. The goal is to maximize the expected total reward which is 

assumed to be finite for all policies. Note that we are not using a discount as part of the problem spec¬ 
ification; it will appear below as an algorithm parameter that adjusts a bias-variance tradeoff But 
the discounted problem (maximizing handled as an instance of the undiscounted 

problem in which we absorb the discount factor into the reward function, making it time-dependent. 

Policy gradient methods maximize the expected total reward by repeatedly estimating the gradient 
g := VeE There are several different related expressions for the policy gradient, which 

have the form 


where rPt may be one of the following: 



1. r^: total reward of the trajectory. 4. a^): state-action value function. 


2. Yl^=t ’'’t'- reward following action at- 

3. ~ b{st)'. baselined version of 


5. at): advantage function. 


previous formula. 


6. Tt + V^{st+i) — V'^{st): TD residual. 


The latter formulas use the definitions 


OO 


OO 




( 2 ) 


Lz=o J 

A'^{st,at) ■= Q''{st,at) - V'^{st), (Advantage function). 


Li=o 


(3) 
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Here, the subscript of E enumerates the variables being integrated over, where states and actions are 
sampled sequentially from the dynamics model P{st+i \ St, at) and policy Tr{at \ St), respectively. 
The colon notation a : b refers to the inclusive range (a, a + 1,..., b). These formulas are well 
known and straightforward to obtain; they follow directly from Proposition 1, which will be stated 
shortly. 

The choice = A'^{st,at) yields almost the lowest possible variance, though in practice, the 
advantage function is not known and must be estimated. This statement can be intuitively justified by 
the following interpretation of the policy gradient; that a step in the policy gradient direction should 
increase the probability of better-than-average actions and decrease the probability of worse-than- 
average actions. The advantage function, by it’s definition a) = a) — measures 

whether or not the action is better or worse than the policy’s default behavior. Hence, we should 
choose rPt to be the advantage function A'^{st, at), so that the gradient term 'i't^o log7rg(at | St) 
points in the direction of increased 7rg(at | St) if and only if A'^{st,at) > 0. See Greensmith et al. 
(2004) for a more rigorous analysis of the variance of policy gradient estimators and the effect of 
using a baseline. 


We will introduce a parameter 7 that allows us to reduce variance by downweighting rewards cor¬ 
responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the 
discount factor used in discounted formulations of MDPs, but we treat it as a variance reduction 
parameter in an undiscounted problem; this technique was analyzed theoretically by Marbach & 
Tsitsiklis (2003); Kakade (2001b); Thomas (2014). The discounted value functions are given by; 


at'.oo 


00 


J=0 


A^'^st,at) :=Q^’'^{st,at)-V^’^St). 


Q'^''^{st,at) ;= Est+i: 


J =0 


(4) 

(5) 


The discounted approximation to the policy gradient is defined as follows; 


5 ^ := EsO:cx= 

O-O'.oo 


^ at)Ve logTTg(at I St) 

.t=o 


( 6 ) 


The following section discusses how to obtain biased (but not too biased) estimators for A'^’’^, giving 
us noisy estimates of the discounted policy gradient in Equation ( 6 ). 


Before proceeding, we will introduce the notion of a 7 -just estimator of the advantage function, 
which is an estimator that does not introduce bias when we use it in place of A'^’'^ (which is not 
known and must be estimated) in Equation ( 6 ) to estimate Consider an advantage estimator 
^t(so:oo, ao:oo), which may in general be a function of the entire trajectory. 

Definition 1. The estimator At is j-just if 


ESO ;00 

(^0:oo 


('^ 0:00 5 


^0:oo )Ve log7re(at | St) 


= Eso:oo at)Velog7re(at | St)]. 

a 0 :oo 


(7) 


It follows immediately that if At is 7 -just for all t, then 


Es0:txJ 

a 0 :oo 


:oo 5 ttOroo lt)g (at l^t) 

.*=0 


= g^ 


( 8 ) 


One sufficient condition for At to be 7 -just is that At decomposes as the difference between two 
functions Qt and bt, where Qt can depend on any trajectory variables but gives an unbiased estimator 
of the 7 -discounted Q-function, and bt is an arbitrary function of the states and actions sampled 
before at- 

Proposition 1. Suppose that At can be written in the form At(so:oo, ao:oo) = Qt(so:ooj ao:oo) — 
bt(,^ 0 :t, aQ-_t—i) such that for all )st,at), ^st+i;cx),at+i:oo I :oo 1 ^t\oo )] = Q^’-^isuat). 

Then A is 'y-just. 

’Note, that we have already introduced bias by using A'^’'^ in place of A'^; here we are concerned with 
obtaining an unbiased estimate of which is a biased estimate of the policy gradient of the undiscounted 
MDP. 
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The proof is provided in Appendix B. It is easy to verify that the following expressions are 7 -just 
advantage estimators for Af. 


E OO / 

i=o 7 n+i 




3 Advantage function estimation 

This section will be concerned with producing an accurate estimate At of the discounted advan¬ 
tage function A^’^{st, at), which will then be used to construct a policy gradient estimator of the 
following form: 


^ N 00 
n—lt—0 

where n indexes over a batch of episodes. 

Let V be an approximate value function. Define = rt + jV (st+i) — V(st), i.e., the TD residual 
of V with discount 7 (Sutton & Barto, 1998). Note that 3^ can be considered as an estimate of the 
advantage of the action at- In fact, if we have the correct value function V = , then it is a 7 -just 

advantage estimator, and in fact, an unbiased estimator of 


E 


St + 1 



= E«,^, [Q^’^ist, at) - V^’^ist)] = A^’^st, at). 


( 10 ) 


However, this estimator is only 7 -just for V = , otherwise it will yield biased policy gradient 

estimates. 

^ ik) 

Next, let us consider taking the sum of k of these 6 terms, which we will denote by A) ' 

4'^ := = -V{st) +rt+ 7 A(st+i) (11) 

■= k + l^Y+i = (st) +rt+ jn+i + 7 V(st+ 2 ) (12) 

■= k + 7 ^Y+i + i‘^^Y+ 2 = ~^i^t) + rt +'yrt+i +'y'^rt+2 + l^V{st+3) (13) 

k-1 

•= E '^''^Y+i = (st) +rt+ jrt+i H-h 7 '="Vt+fc-i + (st+k) (14) 

1=0 


These equations result from a telescoping sum, and we see that involves a fc-step estimate of 
the returns, minus a baseline term —U(si). Analogously to the case of SY = A^Y we can consider 
A) ' to be an estimator of the advantage function, which is only 7 -just when V = . However, 

note that the bias generally becomes smaller as fc —>■ 00 , since the term 7^V{st+k) becomes more 
heavily discounted, and the term —V(st) does not affect the bias. Taking k —>■ 00 , we get 



00 


J^7AY+i = 

1=0 


-y(st) +YYyrt+i, 

1=0 


( 15 ) 


which is simply the empirical returns minus the value function baseline. 
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The generalized advantage estimator GAE( 7 , A) is defined as the exponentially-weighted average 
of these fc-step estimators: 

ifAE(7.A) _ ^^(2) ;^2^(3) 

= (1 - X){SY + X{SY + 'jSY+i) + X^(SY + 7<5t+i + 7^^t+2) + • ■ •) 

= (1 - A)(SY (1 + a + A^ + ...) + 7^t+i(A + A^ + A^ + ...) 

+ 7^t^2(A^ + A^ + A^ + ...) -f ...) 

= (1 - (i^) + (i^) + ( i3a) + ■ ■ ■) 

oo 

1=0 

From Equation (16), we see that the advantage estimator has a remarkably simple formula involving 
a discounted sum of Bellman residual terms. Section 4 discusses an interpretation of this formula as 
the returns in an MDP with a modified reward function. The construction we used above is closely 
analogous to the one used to define TD(A) (Sutton & Barto, 1998), however TD(A) is an estimator 
of the value function, whereas here we are estimating the advantage function. 

There are two notable special cases of this formula, obtained by setting A = 0 and A = 1. 


GAE( 7 , 0 ): At:=St = n + jVist+i) - V{st) (17) 

OO oo 

GAE(7,1): At:=Y,lAt+i=Y.^‘n+i-V{st) (18) 

;=o ;=o 

GAE( 7 , 1) is 7 -just regardless of the accuracy of V, but it has high variance due to the sum of 
terms. GAE( 7 ,0) is 7 -just for V = and otherwise induces bias, but it typically has much 
lower variance. The generalized advantage estimator for 0 < A < 1 makes a compromise between 
bias and variance, controlled by parameter A. 

We’ve described an advantage estimator with two separate parameters 7 and A, both of which con¬ 
tribute to the bias-variance tradeoff when using an approximate value function. However, they serve 
different purposes and work best with different ranges of values. 7 most importantly determines the 
scale of the value function which does not depend on A. Taking 7 < 1 introduces bias into 

the policy gradient estimate, regardless of the value function’s accuracy. On the other hand, A < 1 
introduces bias only when the value function is inaccurate. Empirically, we find that the best value 
of A is much lower than the best value of 7 , likely because A introduces far less bias than 7 for a 
reasonably accurate value function. 

Using the generalized advantage estimator, we can construct a biased estimator of the discounted 
policy gradient from Equation ( 6 ): 


g 


7 


E 


^ Velog7re(at | 
.4=0 


= E 


^ Ve log TTg (at 
.4=0 


1=0 


where equality holds when A = 1. 


(19) 


4 Interpretation as Reward Shaping 

In this section, we discuss how one can interpret A as an extra discount factor applied after per¬ 
forming a reward shaping transformation on the MDP. We also introduce the notion of a response 
function to help understand the bias introduced by 7 and A. 

Reward shaping (Ng et al., 1999) refers to the following transformation of the reward function of 
an MDP: let $ : 5 M be an arbitrary scalar-valued function on state space, and define the 
transformed reward function f by 

f (s, a, s') = r(s, a, s') + 7 $(s') — $( 5 ), (20) 
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which in turn defines a transformed MDR This transformation leaves the discounted advantage 
function A^’~^ unchanged for any policy tt. To see this, consider the discounted sum of rewards of a 


trajectory starting with state St'. 

OO OO 

^ 7'f (st+;, at, St+i+i) = 7V(st+i, at+;, St+i+i) - $(st). (21) 

1=0 1=0 

Letting be the value and advantage functions of the transformed MDP, one obtains 

from the definitions of these quantities that 

a) = a) - <l>(s) (22) 

t/’^.')'(s,a) = y’"’^(s,a)-T>(s) (23) 

a) = a) - <I>(s)) - {a) - $(s)) = a). (24) 


Note that if $ happens to be the state-value function from the original MDP, then the trans¬ 
formed MDP has the interesting property that is zero at every state. 

Note that (Ng et al., 1999) showed that the reward shaping transformation leaves the policy gradient 
and optimal policy unchanged when our objective is to maximize the discounted sum of rewards 
7 ‘r(st, at, Si-i-i)- In contrast, this paper is concerned with maximizing the undiscounted sum 
of rewards, where the discount 7 is used as a variance-reduction parameter. 

Having reviewed the idea of reward shaping, let us consider how we could use it to get a policy 
gradient estimate. The most natural approach is to construct policy gradient estimators that use 
discounted sums of shaped rewards r. However, Equation (21) shows that we obtain the discounted 
sum of the original MDP’s rewards r minus a baseline term. Next, let’s consider using a “steeper” 
discount 7 A, where 0 < A < 1. It’s easy to see that the shaped reward f equals the Bellman residual 
term 6^, introduced in Section 3, where we set ^ = V. Letting $ = E, we see that 

OO OO 

^( 7 A)'f(st-rz,at,St+i+i) = (25) 

1=0 1=0 

Hence, by considering the 7 A-discounted sum of shaped rewards, we exactly obtain the generalized 
advantage estimators from Section 3. As shown previously, A = 1 gives an unbiased estimate of g'^, 
whereas A < 0 gives a biased estimate. 

To further analyze the effect of this shaping transformation and parameters 7 and A, it will be useful 
to introduce the notion of a response function x, which we define as follows: 

Xil; St, at) = E [rt+i \ St, at] - E [rt+i \ s*]. (26) 

Note that A^'^{s, a) — 7*x(^i s, a), hence the response function decomposes the advantage 

function across timesteps. The response function lets us quantify the temporal credit assignment 
problem: long range dependencies between actions and rewards correspond to nonzero values of the 
response function for I ^ 0 . 

Next, let us revisit the discount factor 7 and the approximation we are making by using rather 
than A’’’A. The discounted policy gradient estimator from Equation ( 6 ) has a sum of terms of the 
form 

OO 

Velog7re(at | St)A^''^{st,at) = VglogTrgjat | St) ^ 7 'x(^; s, a)- (27) 

1=0 

Using a discount 7 < 1 corresponds to dropping the terms with I ^ 1/(1 — 7 ). Thus, the error 
introduced by this approximation will be small if x rapidly decays as I increases, i.e., if the effect of 
an action on rewards is “forgotten” after « 1/(1 — 7 ) timesteps. 

If the reward function f were obtained using $ = we would have E [fj+i | St,at] = 

E [rt+i I St] = 0 for I > 0, i.e., the response function would only be nonzero at Z = 0. Therefore, 
this shaping transformation would turn temporally extended response into an immediate response. 
Given that completely reduces the temporal spread of the response function, we can hope that 
a good approximation V « partially reduces it. This observation suggests an interpretation of 
Equation (16): reshape the rewards using V to shrink the temporal extent of the response function, 
and then introduce a “steeper” discount 7 A to cut off the noise arising from long delays, i.e., ignore 
terms Ve log7rg(at | St)SY+i where I ^ 1/(1 — 7 A). 
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5 Value Function Estimation 


A variety of different methods can be used to estimate the value function (see, e.g., Bertsekas 
(2012)). When using a nonlinear function approximator to represent the value function, the sim¬ 
plest approach is to solve a nonlinear regression problem; 

N 

minimizey^||U 0 (s„) - 14|P, (28) 

where Vt = discounted sum of rewards, and n indexes over all timesteps in a 

batch of trajectories. This is sometimes called the Monte Carlo or TD(1) approach for estimating 
the value function (Sutton & Barto, 1998).^ 


For the experiments in this work, we used a trust region method to optimize the value function 
in each iteration of a batch optimization procedure. The trTist region helps us to avoid overfitting 
to the most recent batch of data. To formulate the trust region problem, we first compute = 
^ II E)ioid i^n) — Ip, where (f)oid is the parameter vector before optimization. Then we solve 
the following constrained optimization problem: 


minimize 

<P 


subject to 


N 

Y.\msr.)-Vnr 

n—1 

1 

N ^ 2cr2 

n—\ 


(29) 


This constraint is equivalent to constraining the average KL divergence between the previous value 
function and the new value function to be smaller than e, where the value function is taken to pa¬ 
rameterize a conditional Gaussian distribution with mean V^{s) and variance cr^. 

We compute an approximate solution to the trust region problem using the conjugate gradient algo¬ 
rithm Wright & Nocedal (1999). Specifically, we are solving the quadratic program 

minimize — ^oid) 

1 ^ 

subject to - (/)oid) < e. (30) 

71 — 1 

where g is the gradient of the objective, and H = ^ where = V 0 V 0 (s„). Note that 

H is the “Gauss-Newton” approximation of the Hessian of the objective, and it is (up to a cr^ factor) 
the Fisher information matrix when interpreting the value function as a conditional probability dis¬ 
tribution. Using matrix-vector products v —>■ Hv to implement the conjugate gradient algorithm, we 
compute a step direction s « —H~^g. Then we rescale s —>■ as such that ^{as)^H{as) = e and 
take (j) = (j)oid + as. This procedure is analogous to the procedure we use for updating the policy, 
which is described further in Section 6 and based on Schulman et al. (2015). 


6 Experiments 

We designed a set of experiments to investigate the following questions: 

1. What is the empirical effect of varying A S [0,1] and 7 G [0,1] when optimizing episodic total 
reward using generalized advantage estimation? 

2. Can generalized advantage estimation, along with trust region algorithms for policy and value 
function optimization, be used to optimize large neural network policies for challenging control 
problems? 

^Another natural choice is to compute target values with an estimator based on the TD( A) backup (Bertsekas, 
2012; Sutton & Barto, 1998), mirroring the expression we use for policy gradient estimation: = V'^„,j(sn)-I- 

St+i- While we experimented with this choice, we did not notice a difference in performance from 
the A = 1 estimator in Equation (28). 
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6.1 Policy Optimization Algorithm 

While generalized advantage estimation can be used along with a variety of different policy gra¬ 
dient methods, for these experiments, we performed the policy updates using trust region policy 
optimization (TRPO) (Schulman et al., 2015). TRPO updates the policy by approximately solving 
the following constrained optimization problem each iteration; 


minimize (0) 

6 

_ 0 

subject to IIkl (7 re^,^,7re) < e 




(31) 


n—1 


As described in (Schulman et al., 2015), we approximately solve this problem by linearizing the 
objective and quadraticizing the constraint, which yields a step in the direction 0 — Ooid oc —F~^g, 
where F is the average Fisher information matrix, and p is a policy gradient estimate. This policy 
update yields the same step direction as the natural policy gradient (Kakade, 2001a) and natural 
actor-critic (Peters & Schaal, 2008), however it uses a different stepsize determination scheme and 
numerical procedure for computing the step. 

Since prior work (Schulman et al., 2015) compared TRPO to a variety of different policy optimiza¬ 
tion algorithms, we will not repeat these comparisons; rather, we will focus on varying the 7, A 
parameters of policy gradient estimator while keeping the underlying algorithm fixed. 

For completeness, the whole algorithm for iteratively updating policy and value function is given 
below: 


Initialize policy parameter 0q and value function parameter ipQ. 
for i = 0, 1,2,... do 

Simulate current policy irg. until N timesteps are obtained. 
Compute SY at all timesteps t G {1,2,..., N}, using V = V^.. 
Compute At = all timesteps. 

Compute 0i_|_i with TRPO update. Equation (31). 

Compute (j)i+i with Equation (30). 


end for 


Note that the policy update 0i —> 0ig.i is performed using the value function for advantage 
estimation, not . Additional bias would have been introduced if we updated the value function 
first. To see this, consider the extreme case where we overfit the value function, and the Bellman 
residual rt + jV (st+i) — V(st) becomes zero at all timesteps—the policy gradient estimate would 
be zero. 

6.2 Experimental Setup 

We evaluated our approach on the classic cart-pole balancing problem, as well as several challenging 
3D locomotion tasks: (1) bipedal locomotion; (2) quadrupedal locomotion; (3) dynamically standing 
up, for the biped, which starts off laying on its back. The models are shown in Eigure 1 . 

6.2.1 Architecture 

We used the same neural network architecture for all of the 3D robot tasks, which was a feedforward 
network with three hidden layers, with 100, 50 and 25 tanh units respectively. The same architecture 
was used for the policy and value function. The final output layer had linear activation. The value 
function estimator used the same architecture, but with only one scalar output. Eor the simpler cart- 
pole task, we used a linear policy, and a neural network with one 20-unit hidden layer as the value 
function. 
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T? 




Figure 1: Top figures; robot models used for 3D locomotion. Bottom figures: a sequence of 
frames from the learned gaits. Videos are available at https : //sites . google . com/site/ 
gaepapersupp. 


6.2.2 Task DETAILS 

For the cart-pole balancing task, we collected 20 trajectories per batch, with a maximum length of 
1000 timesteps, using the physical parameters from Barto et al. (1983). 

The simulated robot tasks were simulated using the MuJoCo physics engine (Todorov et al., 2012). 
The humanoid model has 33 state dimensions and 10 actuated degrees of freedom, while the 
quadruped model has 29 state dimensions and 8 actuated degrees of freedom. The initial state 
for these tasks consisted of a uniform distribution centered on a reference configuration. We used 
50000 timesteps per batch for bipedal locomotion, and 200000 timesteps per batch for quadrupedal 
locomotion and bipedal standing. Each episode was terminated after 2000 timesteps if the robot had 
not reached a terminal state beforehand. The timestep was 0.01 seconds. 

The reward functions are provided in the table below. 

Task Reward 

3D biped locomotion t'fwd - lO-'lluf - 10-"||/i„,pactr + 0.2 
Quadruped locomotion ■yfwd — 10 “®||m|P — 10“^||/impact IP + 0.05 
Biped getting up -(/ihead - 1-5)^ - 10“’^||u|p 

Here, r;fwd •= forward velocity, u '■= vector of joint torques, /impact •= impact forces, (ihead •= 
height of the head. 

In the locomotion tasks, the episode is terminated if the center of mass of the actor falls below a 
predefined height: .8 m for the biped, and .2 m for the quadruped. The constant offset in the reward 
function encourages longer episodes; otherwise the quadratic reward terms might lead lead to a 
policy that ends the episodes as quickly as possible. 

6.3 Experimental Results 

All results are presented in terms of the cost, which is defined as negative reward and is mini¬ 
mized. Videos of the learned policies are available at https : //sites . google . com/site/ 
gaepapersupp. In plots, “No VE” means that we used a time-dependent baseline that did not 
depend on the state, rather than an estimate of the state value function. The time-dependent baseline 
was computed by averaging the return at each timestep over the trajectories in the batch. 

6.3.1 Cart-pole 

The results are averaged across 21 experiments with different random seeds. Results are shown in 
Eigure 2, and indicate that the best results are obtained at intermediate values of the parameters; 
7 G [0.96,0.99] and A G [0.92,0.99]. 
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number of policy iterations 


Cart-pole performance after 20 iterations 



A 


Figure 2; Left: learning curves for cart-pole task, using generalized advantage estimation with 
varying values of A at 7 = 0.99. The fastest policy improvement is obtain by intermediate values of 
A in the range [0.92, 0.98]. Right: performance after 20 iterations of policy optimization, as 7 and A 
are varied. White means higher reward. The best results are obtained at intermediate values of both. 
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Figure 3: Left: Learning curves for 3D bipedal locomotion, averaged across nine runs of the algo¬ 
rithm. Right: learning curves for 3D quadrupedal locomotion, averaged across five runs. 


6.3.2 3D BIPEDAL LOCOMOTION 

Each trial took about 2 hours to run on a 16-core machine, where the simulation rollouts were paral¬ 
lelized, as were the function, gradient, and matrix-vector-product evaluations used when optimizing 
the policy and value function. Here, the results are averaged across 9 trials with different random 
seeds. The best performance is again obtained using intermediate values of 7 G [0.99, 0.995], A G 
[0.96,0.99]. The result after 1000 iterations is a fast, smooth, and stable gait that is effectively 
completely stable. We can compute how much “real time” was used for this learning process: 
0.01 seconds/timestepx 50000 t™esteps/batchx 1000 batches/3600 • 24 seconds/jay = 5.8 days. Hence, 
it is plausible that this algorithm could be run on a real robot, or multiple real robots learning in par¬ 
allel, if there were a way to reset the state of the robot and ensure that it doesn’t damage itself 

6.3.3 Other 3D ROBOT TASKS 

The other two motor behaviors considered are quadrupedal locomotion and getting up off the ground 
for the 3D biped. Again, we performed 5 trials per experimental condition, with different random 
seeds (and initializations). The experiments took about 4 hours per trial on a 32-core machine. 
We performed a more limited comparison on these domains (due to the substantial computational 
resources required to run these experiments), fixing 7 = 0.995 but varying A = {0, 0.96}, as well as 
an experimental condition with no value function. For quadrupedal locomotion, the best results are 
obtained using a value function with A = 0.96 Section 6.3.2. For 3D standing, the value function 
always helped, but the results are roughly the same for A = 0.96 and A = 1. 
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Figure 4; (a) Learning curve from quadrupedal walking, (b) learning curve for 3D standing up, (c) 
clips from 3D standing up. 


7 Discussion 

Policy gradient methods provide a way to reduce reinforcement learning to stochastic gradient de¬ 
scent, by providing unbiased gradient estimates. However, so far their success at solving difficult 
control problems has been limited, largely due to their high sample complexity. We have argued that 
the key to variance reduction is to obtain good estimates of the advantage function. 

We have provided an intuitive but informal analysis of the problem of advantage function estimation, 
and justified the generalized advantage estimator, which has two parameters 7, A which adjust the 
bias-variance tradeoff. We described how to combine this idea with trust region policy optimization 
and a trust region algorithm that optimizes a value function, both represented by neural networks. 
Combining these techniques, we are able to learn to solve difficult control tasks that have previously 
been out of reach for generic reinforcement learning methods. 

Our main experimental validation of generalized advantage estimation is in the domain of simu¬ 
lated robotic locomotion. In these domains, the A = 0 As shown in our experiments, choosing an 
appropriate intermediate value of A in the range [0.9, 0.99] usually results in the best performance. 
A possible topic for future work is how to adjust the estimator parameters 7, A in an adaptive or 
automatic way. 

One question that merits future investigation is the relationship between value function estimation 
etTor and policy gradient estimation etTor. If this relationship were known, we could choose an error 
metric for value function fitting that is well-matched to the quantity of interest, which is typically the 
accuracy of the policy gradient estimation. Some candidates for such an error metric might include 
the Bellman error or projected Bellman error, as described in Bhatnagar et al. (2009). 

Another enticing possibility is to use a shared function approximation architecture for the policy and 
the value function, while optimizing the policy using generalized advantage estimation. While for¬ 
mulating this problem in a way that is suitable for numerical optimization and provides convergence 
guarantees remains an open question, such an approach could allow the value function and policy 
representations to share useful features of the input, resulting in even faster learning. 

In concurrent work, researchers have been developing policy gradient methods that involve differen¬ 
tiation with respect to the continuous-valued action (Lillicrap et al., 2015; Heess et al., 2015). While 
we found empirically that the one-step return (A = 0) leads to excessive bias and poor performance, 
these papers show that such methods can work when tuned appropriately. However, note that those 
papers consider control problems with substantially lower-dimensional state and action spaces than 
the ones considered here. A comparison between both classes of approach would be useful for future 
work. 
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A Frequently Asked Questions 

A.l What’s the Relationship with Compatible Features? 

Compatible features are often mentioned in relation to policy gradient algorithms that make use 
of a value function, and the idea was proposed in the paper On Actor-Critic Methods by Konda 
& Tsitsiklis (2003). These authors pointed out that due to the limited representation power of the 
policy, the policy gradient only depends on a certain subspace of the space of advantage functions. 
This subspace is spanned by the compatible features Vg. log 7 re(a( | Sj), where is {1,2,..., dim 0}. 
This theory of compatible features provides no guidance on how to exploit the temporal structure of 
the problem to obtain better estimates of the advantage function, making it mostly orthogonal to the 
ideas in this paper. 

The idea of compatible features motivates an elegant method for computing the natural policy gradi¬ 
ent (Kakade, 2001a; Peters & Schaal, 2008). Given an empirical estimate of the advantage function 
At at each timestep, we can project it onto the subspace of compatible features by solving the fol¬ 
lowing least squares problem: 

minimizey^llr • Velog7re(at | St) - At||^. (32) 

t 

If A is 7 -just, the least squares solution is the natural policy gradient (Kakade, 2001a). Note that 
any estimator of the advantage function can be substituted into this formula, including the ones we 
derive in this paper. For our experiments, we also compute natural policy gradient steps, but we use 
the more computationally efficient numerical procedure from Schulman et al. (2015), as discussed 
in Section 6 . 

A.2 Why Don’t You Just Use a Q-Function? 

Previous actor critic methods, e.g. in Konda & Tsitsiklis (2003), use a Q-function to obtain poten¬ 
tially low-variance policy gradient estimates. Recent papers, including Heess et al. (2015); Lillicrap 
et al. (2015), have shown that a neural network Q-function approximator can used effectively in a 
policy gradient method. However, there are several advantages to using a state-value function in the 
manner of this paper. First, the state-value function has a lower-dimensional input and is thus easier 
to learn than a state-action value function. Second, the method of this paper allows us to smoothly 
interpolate between the high-bias estimator (A = 0) and the low-bias estimator (A = 1). On the other 
hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found 
that the bias is prohibitively large when using a one-step estimate of the returns, i.e., the A = 0 esti¬ 
mator, At = SY = D -f 7 U(s(+i) — V(st). We expect that similar difficulty would be encountered 
when using an advantage estimator involving a parameterized Q-function, At — Q{s,a) — U(s). 
There is an interesting space of possible algorithms that would use a parameterized Q-function and 
attempt to reduce bias, however, an exploration of these possibilities is beyond the scope of this 
work. 


B Proofs 

Proof of Proposition 1: First we can split the expectation into terms involving Q and b, 

^S0:oo ,(i0;oo [^0 \ ) (Qt ('^O: C30 5 ^0:C3o) l))] 

— ^so:oo,ao;oo ^t)g TT^ (fli I Si ) (so:oo 5 ri0:c3O ) )] 

+ Eso:oo.ao:cx, [Vg log ^^(at | S*) (5* (sq: t, OOit- 1 ))] (33) 
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We’ll consider the terms with Q and b in turn. 

®S0: oo ,^0:00 [Ve logTTg(at I st)Qtiso :oo j ^0:oo )] 

^so;t,ao;t [^Sf + i:oo,<it+i:oo | (^0:OO) ^0:oo)]] 

= Egp [Ve log TTe (at I St)E 

St + l:oo ,0.t+l-.oo [Qt (^0 :oo) flO:oo)]] 

= Eso,t.ao.t-i [Velog7r0(at I St)^’"(st,at)] 

Next, 

Eso:oo.oo:oo [Velog7re(at | St)&t(so:t,ao:t-i)] 

= ®S0; £,0-0; t-1 [^St + i:oo,a£;oo [Velog7re(at | St)^'t(so:t, ao:t-i)]] 
= Eso:t.ao:t-i [E5t+i:oo.at,oo [Velog7re(at I St)]5t(so:t,ao:t-l)] 
= Eso:t.ao:t-l 
= 0 . 
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